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Abstract 


We propose to learn a kernel-based message operator which takes as input all ex¬ 
pectation propagation (EP) incoming messages to a factor node and produces an 
outgoing message. In ordinary EP, computing an outgoing message involves esti¬ 
mating a multivariate integral which may not have an analytic expression. Learn¬ 
ing such an operator allows one to bypass the expensive computation of the inte¬ 
gral during inference by directly mapping all incoming messages into an outgoing 
message. The operator can be learned from training data (examples of input and 
output messages) which allows automated inference to be made on any kind of 
factor that can be sampled. 


1 Background 


Existing approaches to automated probabilistic inference can be broadly divided into two categories 
(Heess et ah, 2013): uninformed and informed cases. In the uninformed case, the modeler has full 
freedom in expressing a probabilistic model without any constraint on the set of usable factors. This 
high flexibility comes at a price during inference as less factor-specific information is available to 
the inference engine. Often MCMC-based sampling techniques are employed by treating the factors 
as black boxes. In the informed case (Stan Development Team, 2014; Minka et ah, 2012), the 
modeler is required to build a model from constructs whose necessary computations are known to 
the inference engine. Although efficient during the inference, using an unsupported construct would 
require manual derivation and implementation of the relevant computations in the inference engine. 

In this work, we focus on EP, a commonly used approximate inference method. Following Heess 
et ah (2013), we propose to learn a kernel-based message operator for EP to capture the relationship 
between incoming messages to a factor and outgoing messages. The operator bridges the gap be¬ 
tween the uninformed and informed cases by automatically deriving the relevant computations for 
any custom factor that can be sampled. This hybrid approach gives the modeler as much flexibility 
as in the uninformed case while offering efficient message updates as in the informed case. This 
approach supports fast inference as no expensive KL divergence minimization needs to be done 
during inference as in ordinary EP. In addition, a learned operator for a factor is reusable in other 
models in which the factor appears. As will be seen, to send an outgoing message with the kernel- 
based operator, it is sufficient to generate a feature vector for incoming messages and multiply with 
a pre-learned matrix. Unlike Heess et ah (2013) which considers a neural network, the kernel-based 
message operator we propose can be easily extended to allow online updates of the operator during 
inference. 


‘Nicolas Heess is now affiliated with Google Deepmind. 
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2 Expectation Propagation (EP) 


Expectation propagation (Minka, 2001; Bishop, 2006) (EP) is a commonly used approximate infer¬ 
ence method for inferring the posterior distribution of latent variables given observations. In a typical 
directed graphical model, the joint distribution of the data A = {X \,..., X ,,} and latent variables 
9 = {9 1 ,..., 9t] takes the form of a product of factors, p(X , 9) = II;=i /t(A W) where each factor 
fi may depend on only a subset of X and 9. With A' observed, EP approximates the posterior with 
q(o) « ni: ,-j mf i ^ f e{9) where rrif i ^e is an approximate factor corresponding to ft with the con¬ 
straint that it has a chosen parametric form (e.g., Gaussian) in the exponential family (ExpFam). EP 
takes into account the fact that the final quantity of interest is the posterior q{9) which is given by 
the product of all approximate factors. In finding the i th approximate factor m.f^g, EP uses other 
approximate factors mg^f i (9) := Yl/ ii m fj^s(9) as a context to determine the plausible range 

ta CD ■. f 1 fi f U ■ t U (a\ pr°i[f d.X f(X\8)m x ^f i (X)m e ^ f (0)] 

of 9. EP iteratively refines m^-yg for each i with rrif^gid) = — k ^ 1 

where proj [r] = argmin 9e ExpFam KL [r || q] and (A) := <5(A — Aq) if X is observed to be 

A(). In the EP literature, mg^f i is known as a cavity distribution. 

The projection can be carried out by the following moment matching procedure. Assume an ExpFam 
distribution q(9\rf) = h(9) exp (r] T u{9) — A(ri )) where u(9) is the sufficient statistic of q, 77 is the 
natural parameter and A{rj) = log f d9 h(9) exp (j] r u(9 )) is the log-partition function. It can be 
shown that q* = proj [r] satisfies E ? *(g) [it(0)] = E r ( 0 ) [u(0)] . That is, the projection of r onto 
ExpFam is given by q* £ ExpFam that has the same moment parameters as the moments under r. 

In general, under the approximation that each factor fully factorizes, an EP message from a factor / 
to a variable V takes the form 


mf^v 0 ) = 


Proj [/ dV\{v} f(V) Ily'gv ™ V '^f{v')\ 


proj [r f ^y(v)] _ = qf^vjv) 
m V -yf(v) ’ m V -yf(v) 


where V = V(/) is the set of variables connected to / in the factor graph. In the previous case of 
nrif^e, we have V(/) = {A, 9} and V in Eq. 1 corresponds to 9. Typically, when the factor / is 
complicated, the integral defining becomes intractable. Quadrature rules or other numerical 

integration techniques are often applied to approximate the integral. 


3 Learning to Pass EP Messages 

Our goal is to learn a message operator Cf-yv 1 with signature [mv^f]vev(f) ^ which 

takes in all incoming messages {my^f I V £ V(/)} and outputs qf^,y>(v') i.e., the numerator of 
Eq. 1. For inference, we require one such operator for each recipient variable V' £ V(f) i.e., in total 
|V(/)| operators need to be learned for /. Operator learning is cast as a distribution-to-distribution 
regression problem where the training set Sy := {([TO^_^]ygv(/)j icontaining N 

incoming-outgoing message pairs can be generated as in Heess et al. (2013) by importance sam¬ 
pling to compute the mean parameters E r [w(V)] for moment matching. In principle, the 

importance sampling itself can be used in EP for computing outgoing messages (Barthelme and 
Chopin, 2011). The scheme is, however, expensive as we need to draw a large number of samples 
for each outgoing message to be sent. In our case, the importance sampling is used for data set 
generation which is done offline before the actual inference. 

The assumptions needed for the generation of a training set are as following. Firstly, we assume 
the factor / takes the form of a conditional distribution f(vi\v2, ■ ■ ■ ; t ’|V(/)|)- Secondly, given 
V 2 , ■ ■ ■, v\ v(/)|, vi can be sampled from /(• \v 2 , ..., f|v(/)|)- The ability to evaluate / is not as¬ 
sumed. Finally we assume that a distribution on the natural parameters of all incoming messages 
{my^yf ]vgy(/) i s available. The distribution is used solely to give a rough idea of incoming mes¬ 
sages the learned operator will encounter during the actual EP inference. In practice, we only need to 
ensure that the distribution sufficiently covers the relevant region in the space of incoming messages. 

In recent years, there have been a number of works on the regression task with distributional inputs, 
including Poczos et al. (2013); Szabo et al. (2014) which mainly focus on the non-parametric case 
and are operated under the assumption that the samples from the distributions are observed but not 
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the distributions themselves. In our case, the distributions (messages) are directly observed. More¬ 
over, since the distributions are in ExpFam, they can be characterized by a finite-dimensional natural 
parameter vector or expected sufficient statistic. Hence, we can simplify our task to distribution-to- 
vector regression where the output vector contains a finite number of moments sufficient to char¬ 
acterize qf^v 1 - As regression input distributions are in ExpFam, one can also treat the task as 
vector-to-vector regression. However, seeing the inputs as distributions allows one to use kernels on 
distributions which are invariant to parametrization. 

Once the training set Sy' is obtained, any distribution-to-vector regression function can be applied 
to learn a message operator Given incoming messages, the operator outputs <lf^v' from 

which the outgoing EP message is given by uif^v' = Qf^v 1 /my'-tf which can be computed an¬ 
alytically. We opt for kernel ridge regression (Scholkopf and Smola, 2002) as our message operator 
for its simplicity, its potential use in an online setting (i.e., incremental learning during inference), 
and rich supporting theory. 

3.1 Kernel Ridge Regression 

We consider here the problem of regressing smoothly from distribution-valued inputs to feature¬ 
valued outputs. We follow the regression framework of Micchelli and Pontil (2005), with conver¬ 
gence guarantees provided by Caponnetto and De Vito (2007). Under smoothness constraints, this 
regression can be interpreted as computing the conditional expectation of the output features given 
the inputs (Grunewalder et al., 2012). 

LetX = (xi| •• • | xjv) be the training regression inputs and Y = ^E g i u{v')\ ‘'' |E ? n £ 

be the regression outputs. The ridge regression in the primal form seeks W £ M. D y xD for 
the regression function g(x) = Wx which minimizes the squared-loss function J(W) = Y2iLi ||y* ~ 
Wxj||| + Atr (WW T ) where A is a regularization parameter and tr denotes a matrix trace. It is 
well known that the solution is given by W = YX T (XX T + A/) 1 which has an equivalent dual 
solution W = Y (K + A/) -1 X T = /IX . The dual formulation allows one to regress from any 
type of input objects if a kernel can be defined. All the inputs enter to the regression function 
through the gram matrix K £ R NxN where {K) rj = k(x.;, Xj) yielding the regression function of 

the form g(x) = din(xi,x) where A := (ai| • ■ • |ajv). The dual formulation therefore allows 

one to straightforwardly regress from incoming messages to vectors of mean parameters. Although 
this property is appealing, the training size N in our setting can be chosen to be arbitrarily large, 
making computation of g(x) expensive for a new unseen point x. To eliminate the dependency on 
N, we propose to apply random Fourier features (Rahimi and Recht, 2007) £(x) £ R ,) for x := 
[my^/]y £ v(/) suc h that k(x, x') « cj)(x) T <j)(x') where D is the number of random features. The 
use of the random features allows us to go back to the primal form of which the regression function 
g(cf>(x)) = \N(f>{x) can be computed efficiently. In effect, computing an EP outgoing message 
requires nothing more than a multiplication of a matrix W (D y x D ) with the //-dimensional 
feature vector generated from the incoming messages. 

3.2 Kernels on Distributions 

A number of kernels on distributions have been studied in the literature (Jebara and Kondor, 2003; 
Jebara et al., 2004). Relevant to us are kernels whose random features can be efficiently computed. 
Due to the space constraint, we only give a few examples here. 

Expected Product Kernel Let g r n ) := E r <i ), a \k(-,a) be the mean embedding (Smola et al., 
2007) of the distribution into RKHS 'H' r> induced by the kernel k. Assume k = k gau ss (Gaussian 
kernel) and assume there are c incoming messages x := (r^(aW))? =1 and y := (sW(6W))? =1 . 
The expected product kernel /t pro is defined as 

/ 0 c \ c 

Kpr° ( X >y) := ( 0 (VO , 0 (VO ) = n E ^ W («) E s W (b) fc gauss (M) » 0(x) T </>(y) 

\ Z=1 Z=1 / Zssl 
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Log KL divergence. Mean: -1.880, SD: 1.827 



Figure 1: Log KL-divergence on a logistic factor test set using kernel on joint embeddings. 

where <^(x) T 0(y) = ^(0(r(0) T <£(0(g(0). The feature map <^( J )( r (0) can be estimated by 

applying the random Fourier features to kgJiss and taking the expectations E r (i)( a )E s ( 0 ( 6 ). The final 
feature map is </>(x) = 0 W (j-W) © (A 2 ^)®---® </^ (A c ^) G M d where © denotes a Kronecker 
product and we assume that G E d for l G {1,..., c}. 

Kernel on Joint Embeddings Another way to define a kernel on x, y is to mean-embed both joint 
distributions r = n-= . A 1 ' 1 and s and define the kernel to be Kj 0 int(x, y) := {fi r , /z s )c 

where Q is an RKHS consisting of functions g : X (1 ' x • • • x X 1 ^ —> E and X ' 1 ' denotes the domain 
of A 1 ' and s'' 1 ’ . This kernel is equivalent to the expected product kernel on the joint distributions. 

4 Experiment 

As a preliminary experiment, we consider the logistic factor f(z\x) = 5 (^z — 1+eX p(_ a .) ^ which is 

deterministic when conditioning on x. The factor is used in many common models including binary 
logistic regression. The two incoming messages are mx^f{x) = M(x\ n, er 2 ) and mz-yf{z) = 
Beta(z;a,/?). We randomly generate 2000 training input messages and learn a message operator 
using the kernel on joint embeddings. Kernel parameters are chosen by cross validation and the 
number of random features D is set to 2000. We report log A'L[g||g] where q = qf^x is the 
ground truth output message obtained by importance sampling and q is the message output from the 
operator. For better numerical scaling, regression outputs are set to (E g [a:] ,logV 9 [x]) instead of 
the expectations of the first two moments. The histogram of log KL errors is shown on the left of 
Fig. 1. The right figure shows sample output messages at different log KL errors. It can be seen 
that the operator is able to capture the relationship of incoming and outgoing messages. With higher 
training size, increased number of random features and well chosen kernel parameters, we expect to 
see a significant improvement in the operator’s accuracy. 

5 Conclusion and Future Work 

We propose to learn to send EP messages with kernel ridge regression by casting the KL mini¬ 
mization problem as a supervised learning problem. With random features, incoming messages to 
a learned operator are converted to a finite-dimensional vector. Computing an outgoing message 
amounts to computing the moment parameters by multiplying the vector with a matrix given by the 
solution of the primal ridge regression. 

By virtue of the primal form, it is straightforward to derive an update equation for an online-active 
learning during EP inference: if the predictive variance (similar to a Gaussian process) on the cur¬ 
rent incoming messages is high, then we query the outgoing message from the importance sampler 
(oracle) and update the operator. Otherwise, the outgoing message is efficiently computed by the 
operator. Online learning of the operator lessens the need of the distribution on natural parameters 
of incoming messages used in training set generation. Determining an appropriate distribution was 
one of the unsolved problems in Heess et al. (2013). 
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